Analyzing the community activity for version control systems

Context

You are a new team member in a software company
The developers there are using CVS (Concurrent Versions System)
You propose Git as an alternative to the team.

Find evidence that shows that the software development community is mainly adopting the Git version control system!

The Dataset

There is a dataset Stack Overflow available with the following data:

CreationDate: the timestamp of the creation date of a Stack Overflow post (= question)
TagName: the tag name for a technology (in our case for only 4 VCSes: "cvs", "svn", "git" and "mercurial")
ViewCount: the numbers of views of a post

These are the first 10 entries of this dataset:

CreationDate,TagName,ViewCount
2008-08-01 13:56:33,svn,10880
2008-08-01 14:41:24,svn,55075
2008-08-01 15:22:29,svn,15144
2008-08-01 18:00:13,svn,8010
2008-08-01 18:33:08,svn,92006
2008-08-01 23:29:32,svn,2444
2008-08-03 22:38:29,svn,871830
2008-08-03 22:38:29,git,871830
2008-08-04 11:37:24,svn,17969

Analysis

Step 1: Load in the dataset



In [1]:

    
import pandas as pd

vcs_data = pd.read_csv('../dataset/stackoverflow_vcs_data_subset.gz')
vcs_data.head()









    Out[1]:







  
    
      
      CreationDate
      TagName
      ViewCount
    
  
  
    
      0
      2008-08-01 13:56:33
      svn
      10880
    
    
      1
      2008-08-01 14:41:24
      svn
      55075
    
    
      2
      2008-08-01 15:22:29
      svn
      15144
    
    
      3
      2008-08-01 18:00:13
      svn
      8010
    
    
      4
      2008-08-01 18:33:08
      svn
      92006

Step 2: Convert the `CreationDate` column to a real datetime datatype



In [2]:

    
vcs_data['CreationDate'] = pd.to_datetime(vcs_data['CreationDate'])
vcs_data.head()









    Out[2]:







  
    
      
      CreationDate
      TagName
      ViewCount
    
  
  
    
      0
      2008-08-01 13:56:33
      svn
      10880
    
    
      1
      2008-08-01 14:41:24
      svn
      55075
    
    
      2
      2008-08-01 15:22:29
      svn
      15144
    
    
      3
      2008-08-01 18:00:13
      svn
      8010
    
    
      4
      2008-08-01 18:33:08
      svn
      92006

Step 3: Sum up the number of views in `ViewCount` by the timestamp and the VCSes



In [3]:

    
number_of_views = vcs_data.groupby(['CreationDate', 'TagName']).sum()
number_of_views.head()









    Out[3]:







  
    
      
      
      ViewCount
    
    
      CreationDate
      TagName
      
    
  
  
    
      2008-08-01 13:56:33
      svn
      10880
    
    
      2008-08-01 14:41:24
      svn
      55075
    
    
      2008-08-01 15:22:29
      svn
      15144
    
    
      2008-08-01 18:00:13
      svn
      8010
    
    
      2008-08-01 18:33:08
      svn
      92006

Step 4: List the number of views for each VCS in separate columns



In [4]:

    
views_per_vcs = number_of_views.unstack()['ViewCount']
views_per_vcs.head()









    Out[4]:







  
    
      TagName
      cvs
      git
      mercurial
      svn
    
    
      CreationDate
      
      
      
      
    
  
  
    
      2008-08-01 13:56:33
      NaN
      NaN
      NaN
      10880.0
    
    
      2008-08-01 14:41:24
      NaN
      NaN
      NaN
      55075.0
    
    
      2008-08-01 15:22:29
      NaN
      NaN
      NaN
      15144.0
    
    
      2008-08-01 18:00:13
      NaN
      NaN
      NaN
      8010.0
    
    
      2008-08-01 18:33:08
      NaN
      NaN
      NaN
      92006.0

Step 5: Accumulate the number of views for the VCSes for every month



In [5]:

    
monythly_views = views_per_vcs.resample("1M").sum().cumsum()
monythly_views.head()









    Out[5]:







  
    
      TagName
      cvs
      git
      mercurial
      svn
    
    
      CreationDate
      
      
      
      
    
  
  
    
      2008-08-31
      165811.0
      2502424.0
      755846.0
      3488439.0
    
    
      2008-09-30
      428713.0
      15224415.0
      891127.0
      11405500.0
    
    
      2008-10-31
      637236.0
      25672697.0
      1114432.0
      14295198.0
    
    
      2008-11-30
      752735.0
      32492475.0
      1184003.0
      15848089.0
    
    
      2008-12-31
      878710.0
      37829614.0
      1222065.0
      18256764.0

Step 6: Visualize the monthly views over time for all VCSes



In [6]:

    
%matplotlib inline
monythly_views.plot(title="monthly stackoverflow post views");

Discussion

What are your conclusions? Discuss!

	CreationDate	TagName	ViewCount
0	2008-08-01 13:56:33	svn	10880
1	2008-08-01 14:41:24	svn	55075
2	2008-08-01 15:22:29	svn	15144
3	2008-08-01 18:00:13	svn	8010
4	2008-08-01 18:33:08	svn	92006

		ViewCount
CreationDate	TagName
2008-08-01 13:56:33	svn	10880
2008-08-01 14:41:24	svn	55075
2008-08-01 15:22:29	svn	15144
2008-08-01 18:00:13	svn	8010
2008-08-01 18:33:08	svn	92006

TagName	cvs	git	mercurial	svn
CreationDate
2008-08-01 13:56:33	NaN	NaN	NaN	10880.0
2008-08-01 14:41:24	NaN	NaN	NaN	55075.0
2008-08-01 15:22:29	NaN	NaN	NaN	15144.0
2008-08-01 18:00:13	NaN	NaN	NaN	8010.0
2008-08-01 18:33:08	NaN	NaN	NaN	92006.0

TagName	cvs	git	mercurial	svn
CreationDate
2008-08-31	165811.0	2502424.0	755846.0	3488439.0
2008-09-30	428713.0	15224415.0	891127.0	11405500.0
2008-10-31	637236.0	25672697.0	1114432.0	14295198.0
2008-11-30	752735.0	32492475.0	1184003.0	15848089.0
2008-12-31	878710.0	37829614.0	1222065.0	18256764.0